Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for non-UTF-8 strings #46

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

hanmertens
Copy link

This is a start towards allowing arguments that aren't UTF-8. At the moment there's just the groundwork to show that argument parsing is also possible based on &OsStr. It's not finished yet, but I'd like to get some feedback and discuss some design decisions beforehand. The API of gumdrop itself has to change quite a bit, but I think users that just use derive(Options) shouldn't need to notice much of the change.

Related to #15.

What becomes &OsStr or AsRef<OsStr>?

The parser accepts now AsRef<OsStr> instead of AsRef<str>, but normal string (slices) implement this too, so that isn't too big of a change. All commands, short/long options stay &str, but free arguments and arguments to long options become &OsStr before being parsed in gumdrop_derive.

The functions gumdrop::{parse_args, parse_args_default} currently accept &[String]. I think this either can be kept the same but accepting &[S] where S: AsRef<OsStr> or two new functions gumdrop::{parse_args_os, parse_args_os_default} that accept &[OsString] can be made.

In gumdrop_derive the parsing options based on &str can remain, and the arguments can be converted from &OsStr to &str first, yielding an error if that's not possible (without this PR, this would yield a panic). For this purpose I made a doc-hidden but public function gumdrop::to_str, I'm not sure if that's the best approach. In addition to parsing based on &str, parsing options based on &OsStr can be added (basically the whole point of this PR):

  • parse(from_os_str = "...") for fn(&OsStr) -> T
  • parse(try_from_os_str = "...") for fn(&OsStr) -> Result<T, E> where E: Display
  • parse(from_os_str) uses std::convert::From::from
  • I don't think parse(try_from_str) has a good OsStr equivalent

Limitations

If available, platform-specific OsStrExt is used to look ahead and parse arguments starting from the beginning. Otherwise, the only API available to get the inner data of the OsStr is to_string_lossy, which checks the entire string for valid UTF-8 and allocates a String if that isn't the case, replacing invalid UTF-8 with the replacement character. This means an option like --valid-utf8=invalid-utf8 cannot be parsed into LongWithArg(&str, &OsStr), because there is no way to shorten the &OsStr to invalid-utf8 while keeping the invalid UTF-8 characters. However, --valid-utf8 invalid-utf8 (with a space instead of equal sign) can be parsed correctly on all platforms, and most platforms (especially in terms of usage) implement a form of OsStrExt.

At the moment I crudely check with cfg(unix) for unix's version of OsStrExt, but this should be extended to other platforms that export the same OsStrExt but not all other unix stuff.

Windows has a different OsStrExt trait, which cannot turn --valid-utf8=invalid-utf8 into LongWithArg(&str, &OsStr), but can turn it into LongWithArg(Cow<str>, Cow<OsStr>). This is because Windows uses 16-bit code points internally, which translates to that you can't shorten a &OsStr, but you can create a new OsString that contains only part of the original &OsStr.

This paves the way for deriving fields in Options structs directly from
OsStr, e.g. allowing for non-UTF-8 file paths. Commands and short/long
options still need to be valid UTF-8.
This requires Opt::LongWithArg to use Cow (or (Os)String), &OsStr cannot
be shortened on Windows. This in turn means the enum cannot be Copy, and
should therefore be passed by reference whenever possible.
This does not happen automatically when a later function is
generic (e.g. From::from).
Three types of parsing support that have &str counterparts:
from_os_str (named and unnamed) and try_from_os_str (named). The fourth
type of &str-based parsing (unnamed try_from_str using
std::str::FromStr) does not have an &OsStr counterpart.
The same OsStrExt is present on some platforms that are not considered
to be in the Unix family. Using more fine-grained cfg-attributes, these
platforms can also support --valid-utf8=invalid-utf8. I have verified
that it compiles for the x86_64-fortanix-unknown-sgx and wasm32-wasi
targets, but not that it runs correctly.
@hanmertens
Copy link
Author

I think this PR is mostly finished, with the following caveats:

  • Options like --valid-utf8=invalid-utf8 work on most platforms (Unix, Windows, and some more niche platforms), but not all.
  • Support for --valid-utf8=invalid-utf8 for Fortanix SGX depends on the cfg_target_vendor feature, which was stabilized in Rust 1.33.0. Currently the gumdrop appears to work with Rust versions as low as 1.31.0, so this would either be raised to 1.33.0 or support for --valid-utf8=invalid-utf8 needs to be dropped for Fortanix SGX.

@hanmertens hanmertens marked this pull request as ready for review February 17, 2021 22:38
jirutka added a commit to jirutka/gumdrop that referenced this pull request Jan 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant